Search CORE

48 research outputs found

Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications

Author: Wu Yinjun
Publication venue: ScholarlyCommons
Publication date: 01/01/2021
Field of study

Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios

ScholarlyCommons@Penn

MDB: Interactively Querying Datasets and Models

Author: Naik Aaditya
Naik Mayur
Stein Adam
Wong Eric
Wu Yinjun
Publication venue
Publication date: 13/08/2023
Field of study

As models are trained and deployed, developers need to be able to systematically debug errors that emerge in the machine learning pipeline. We present MDB, a debugging framework for interactively querying datasets and models. MDB integrates functional programming with relational algebra to build expressive queries over a database of datasets and model predictions. Queries are reusable and easily modified, enabling debuggers to rapidly iterate and refine queries to discover and characterize errors and model behaviors. We evaluate MDB on object detection, bias discovery, image classification, and data imputation tasks across self-driving videos, large language models, and medical records. Our experiments show that MDB enables up to 10x faster and 40\% shorter queries than other baselines. In a user study, we find developers can successfully construct complex queries that describe errors of machine learning models

arXiv.org e-Print Archive

Data Citation: A New Provenance Challenge

Author: Alawini Abdussalam
DAVIDSON SUSAN B
SILVELLO GIANMARIA
Tannen Val
Wu Yinjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Università di Padova

Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series

Author: Chen Haifeng
Chen Zhengzhang
Cheng Wei
Davidson Susan
Liu Yanchi
Ni Jingchao
Song Dongjin
Wu Yinjun
Zhang Xuchao
Zong Bo
Publication venue
Publication date: 02/03/2021
Field of study

Forecasting on sparse multivariate time series (MTS) aims to model the predictors of future values of time series given their incomplete past, which is important for many emerging applications. However, most existing methods process MTS's individually, and do not leverage the dynamic distributions underlying the MTS's, leading to sub-optimal results when the sparsity is high. To address this challenge, we propose a novel generative model, which tracks the transition of latent clusters, instead of isolated feature representations, to achieve robust modeling. It is characterized by a newly designed dynamic Gaussian mixture distribution, which captures the dynamics of clustering structures, and is used for emitting timeseries. The generative model is parameterized by neural networks. A structured inference network is also designed for enabling inductive analysis. A gating mechanism is further introduced to dynamically tune the Gaussian mixture distributions. Extensive experimental results on a variety of real-life datasets demonstrate the effectiveness of our method.Comment: This paper is accepted by AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Why data citation isn't working, and what to do about it

Author: Buneman Peter
Christie Greig
Davies Jamie A.
Dimitrellou Roza
Harding Simon D.
Pawson Adam J.
Sharman Joanna L.
Wu Yinjun
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/05/2020
Field of study

Edinburgh Research Explorer

Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach

Author: Du Yinhao
Huang Yuan
Jiang Yu
Ma Shuangge
Ren Jie
Wu Cen
Zhao Yinjun
Publication venue
Publication date: 01/01/2017
Field of study

Lung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma

University of Memphis Digital Commons

K-State Research Exchange

Influence of turbid flood water release on sediment deposition and phosphorus distribution in the bed sediment of the Three Gorges Reservoir, China

Author: Bowes Michael J.
Li Rui
Tang Xianqiang
Wu Min
Zhao Liangyuan
Zhao Weihua
Zhou Yinjun
Publication venue: 'Elsevier BV'
Publication date: 01/03/2019
Field of study

Excessive phosphorus (P) loading was identified as an urgent problem during the post-Three Gorges Reservoir (TGR) period. Turbid water with high suspended sediment loads has been periodically released during the flood season to mitigate sediment deposition in the TGR, but limited attention has been paid to its effect on the distribution of P in bed sediment within the reservoir. In this study, field surveys, historical monitoring data related to sediment deposition, and physiochemical properties and the fractional P content in the mainstream surface sediment and representative column sediment, were used to investigate the effect of turbid flood water release on P distribution in bed sediment. The results revealed that turbid flood water release could discharge approximately 20% of the suspended sediment inflow entering the TGR. Additionally, both the particle size of the inflow sediment and suspended sediment flux tended to decline, and the deposited sediment volume tended to constantly increase in the TGR at a rate of 0.117 billion tonnes per year between 2004 and 2016. The median particle size (MPS) was larger for surface sediment obtained in the flood season than for that obtained in the dry season, and the MPS tended to increase with an increase in the sediment depth from 0 to 20 cm. The total phosphorus (TP) content in sediment ranged from 2.6% to 17.5% lower in the flood water releasing period than in the non-flood water storing period. However, no consistent variation was detected for the vertical distribution of P fraction in the top 20 cm of bed sediment. Compared with lakes with slow deposition rates, the TGR showed a rapid sedimentation rate of >1.0 m/y, which mostly resulted in the uniform distribution of the surface sediment P fraction

NERC Open Research Archive